Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ChannelExecutor #4882

Conversation

Aaronontheweb
Copy link
Member

Working on deprecating the ole' ForkJoinDispatcher and thought I'd try using System.Threading.Channels - results are pretty disappointing so far. Current built-in implementations of the ThreadPool dispatchers absolutely stomp these ones in terms of performance.

<table>
<thead><tr><th> Method</th><th>Mean</th><th>Error</th><th>StdDev</th>
</tr>
</thead><tbody><tr><td>Actor_ping_pong_single_pair_in_memory</td><td>382.4 ns</td><td>9.51 ns</td><td>5.66 ns</td>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number looked suspiciously good, so I decided to check it against several other benchmarks - for whatever reason Channels did well here but got absolutely destroyed in the original PingPong benchmark as well as in the NBench suite.

@@ -57,6 +61,11 @@ public static Config CreateActorSystemConfig(string actorSystemName, string ipOr
port = 0
hostname = ""localhost""
}

default-remote-dispatcher = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemotePingPong numbers running two different Channel<T> dispatchers were about half of what they are in dev

});

for (var i = 0; i < degreeOfParallelism; i++)
Task.Run(() => ReadChannel(_channel.Reader));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is the biggest performance problem, how the ChannelReader<IRunnable> actually performs consumption via the ThreadPool (or in this case, the default TaskScheduler) but this is how a lot of other projects and implementations do it. I suspect their performance requirements aren't nearly as high as ours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at how UnboundedChannel is implemented. The basic approach is that it's a ConcurrentQueue<> + a deque with lock for awaiting readers subscription.

If I understand correctly major incentive over just dispatching tasks to threadpool is control over degreeOfParallelism.

You could probably just replace channel with concurrent queue with asynchronous version of AutoResetEvent (it can have very simple construction as it's never awaited more than once).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the basic goal is to bound the degree of parallelism for multiple workloads in order to prevent starvation of /system and /remote actors - either the queue has to exist at the level of the OS' thread scheduler (which is what happens today) or it has to exist above it in the form of a virtual work queue. Our working theory is that we're probably better off letting the .NET ThreadPool manage the former and for us to apply an abstraction above the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just replace channel with concurrent queue with asynchronous version of AutoResetEvent (it can have very simple construction as it's never awaited more than once).

I'll give that a try!

@Aaronontheweb
Copy link
Member Author

Made some changes to how we execute the read-side of the ChannelDispatcher and it is now significantly faster. Have some more tweaks I want to make, but this is progress.

@Aaronontheweb
Copy link
Member Author

So latest results from the benchmarks:

  1. PingPong benchmark - need to re-run these, but replacing the default dispatcher and the ForkJoinDispatcher with a ChannelExecutor resulted in a performance increase. Looks promising.
  2. RemotePingPong benchmark - a clean dev branch ran showed about 80-81 active threads running on my machine (16 cores) and an upper limit of 192,000 msg/s. Running with the ChannelExecutor showed 50 active threads (bear in mind this is the sum of two ActorSystems running inside the same process for both cases) but peak performance dropped to 164,000 msg/s.

This is a big performance increase from my first attempt at using System.Threading.Channels and my goal was to cut down on the number of threads waiting for work in Akka.NET - that appears to be a success so far but it's come at the cost of 20% throughput. Not sure that's worth it. Going to keep tweaking this to see if I can improve performance on it.

@Aaronontheweb
Copy link
Member Author

Related: #4537

@Aaronontheweb
Copy link
Member Author

Using https://github.com/akkadotnet/akka.net-integration-tests/tree/master/src/ClusterPingPong to measure the efficacy of these changes, running as separate .NET 5 processes on the same host machine. 2 node cluster.

I tried running these benchmarks inside docker-compose initially, but the overhead of WSL / docker Vnet / docker itself made it impossible to get a clear signal on the throughput (numbers were pretty similar for both.)

Running on bare metal itself tells a different story though.

With TaskScheduler(MaxConcurrency = VirtualCpuCount)

OSVersion:                         Microsoft Windows NT 10.0.19041.0
ProcessorCount:                    16
Actor Count:                       32
Node Count:                        2
Active Connections:                1
Msgs sent/received per connection: 200000  (2e5)
Is Server GC:                      True
Thread count per node:             45

Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms]
       1,        5,       10,  1,200,000,    179,543,    6683.62
       1,       10,       20,  2,200,000,    170,565,   12898.27
       1,       15,       30,  3,200,000,    158,513,   20187.56
       1,       20,       40,  4,200,000,    156,176,   26892.67
       1,       25,       50,  5,200,000,    156,961,   33129.11
       1,       30,       60,  6,200,000,    153,478,   40396.54

With v1.4.18 (ForkJoinExecutor and ThreadPoolExecutor)

OSVersion:                         Microsoft Windows NT 10.0.19041.0
ProcessorCount:                    16
Actor Count:                       32
Node Count:                        2
Active Connections:                1
Msgs sent/received per connection: 200000  (2e5)
Is Server GC:                      True
Thread count per node:             62

Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms]
       1,        5,       10,  1,200,000,    214,524,    5593.78
       1,       10,       20,  2,200,000,    198,502,   11082.99
       1,       15,       30,  3,200,000,    193,198,   16563.26
       1,       20,       40,  4,200,000,    188,137,   22324.13
       1,       25,       50,  5,200,000,    186,125,   27938.12
       1,       30,       60,  6,200,000,    182,807,   33915.48

Looks like the TaskSchedulerExecutor got dwarfed by our control - need to try using the ChannelExecutor next.

@Aaronontheweb
Copy link
Member Author

Made one small change:

ThreadPool.SetMinThreads(0,0) - this tells .NET to not set a minimum number of I/O and worker threads and instead allows the pool to scale up and down dynamically.

OSVersion:                         Microsoft Windows NT 10.0.19041.0
ProcessorCount:                    16
Actor Count:                       32
Node Count:                        2
Active Connections:                1
Msgs sent/received per connection: 200000  (2e5)
Is Server GC:                      True
Thread count per node:             45

Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms]
       1,        5,       10,  1,200,000,    190,670,    6293.58
       1,       10,       20,  2,200,000,    176,469,   12466.75
       1,       15,       30,  3,200,000,    185,703,   17231.78
       1,       20,       40,  4,200,000,    199,370,   21066.26
       1,       25,       50,  5,200,000,    203,247,   25584.53
       1,       30,       60,  6,200,000,    200,973,   30849.79

What a difference that makes! And these numbers were all generated at well below 100% CPU, which means that the bottleneck is now somewhere closer to the I/O pipeline:

image

@Aaronontheweb
Copy link
Member Author

Performance numbers using our control for ThreadPool.SetMinThreads(0,0) were pretty much the same - including the gradual performance degradation as load increased.

@Aaronontheweb
Copy link
Member Author

Trying this out using the System.Threading.Channels implementation instead:

OSVersion:                         Microsoft Windows NT 10.0.19041.0
ProcessorCount:                    16
Actor Count:                       32
Node Count:                        2
Active Connections:                1
Msgs sent/received per connection: 200000  (2e5)
Is Server GC:                      True
Thread count per node:             45

Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms]
       1,        5,       10,  1,200,000,    178,031,    6740.37
       1,       10,       20,  2,200,000,    170,268,   12920.74
       1,       15,       30,  3,200,000,    167,911,   19057.66
       1,       20,       40,  4,200,000,    177,717,   23632.95
       1,       25,       50,  5,200,000,    175,869,   29567.32
       1,       30,       60,  6,200,000,    193,511,   32039.38

Going to try re-running that one more time since the last figure has me scratching my head...

@Aaronontheweb
Copy link
Member Author

Looked better the second time around:

OSVersion:                         Microsoft Windows NT 10.0.19041.0
ProcessorCount:                    16
Actor Count:                       32
Node Count:                        2
Active Connections:                1
Msgs sent/received per connection: 200000  (2e5)
Is Server GC:                      True
Thread count per node:             45

Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms]
       1,        5,       10,  1,200,000,    194,142,    6181.02
       1,       10,       20,  2,200,000,    176,987,   12430.24
       1,       15,       30,  3,200,000,    179,937,   17783.94
       1,       20,       40,  4,200,000,    192,664,   21799.51
       1,       25,       50,  5,200,000,    189,140,   27492.84
       1,       30,       60,  6,200,000,    189,949,   32640.24

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Apr 28, 2021

Numbers using the latest https://github.com/akkadotnet/akka.net-integration-tests/tree/master/src/ClusterPingPong

Docker

Using v1.4.18

seed_1          | OSVersion:                         Unix 5.4.72.2
seed_1          | ProcessorCount:                    16
seed_1          | Node Count:                        2
seed_1          | Active Connections:                1
seed_1          | Msgs sent/received per actor:      200000  (2e5)
seed_1          | Is Server GC:                      True
seed_1          | Thread count per node:             41
seed_1          |
seed_1          | Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms], Threads
seed_1          |        1,        5,             10,    1,000,000,    118,759,    8420.36,      3 -> 57
seed_1          |        1,       10,             20,    2,000,000,    133,184,   15016.80,      1 -> 51
seed_1          |        1,       15,             30,    3,000,000,    135,418,   22153.53,      1 -> 50
seed_1          |        1,       20,             40,    4,000,000,    135,450,   29530.99,      1 -> 50
seed_1          |        1,       25,             50,    5,000,000,    140,668,   35544.44,      2 -> 50
seed_1          |        1,       30,             60,    6,000,000,    133,377,   44985.23,      2 -> 50

Using TaskSchedulerExecutor w/ Bounded Concurrency in v1.4.19

seed_1          | OSVersion:                         Unix 5.4.72.2
seed_1          | ProcessorCount:                    16
seed_1          | Node Count:                        2
seed_1          | Active Connections:                1
seed_1          | Msgs sent/received per actor:      200000  (2e5)
seed_1          | Is Server GC:                      True
seed_1          | Thread count per node:             41
seed_1          |
seed_1          | Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms], Threads
seed_1          |        1,        5,             10,    1,000,000,    111,815,    8943.31,    3 -> 57
seed_1          |        1,       10,             20,    2,000,000,    136,361,   14666.87,    2 -> 51
seed_1          |        1,       15,             30,    3,000,000,    142,152,   21104.12,    2 -> 50
seed_1          |        1,       20,             40,    4,000,000,    141,230,   28322.51,    1 -> 50
seed_1          |        1,       25,             50,    5,000,000,    138,405,   36125.68,    2 -> 50
seed_1          |        1,       30,             60,    6,000,000,    135,683,   44220.70,    2 -> 50

Numbers inside Docker look very similar, including the thread counts. But I was able to verify that the ChannelExecutor is indeed running.

Bare Metal

Using v1.4.18

OSVersion:                         Microsoft Windows NT 10.0.19041.0                  
ProcessorCount:                    16                                                 
Node Count:                        2                                                  
Active Connections:                1                                                  
Msgs sent/received per actor:      200000  (2e5)                                      
Is Server GC:                      True                                               
Thread count per node:             61                                                 
                                                                                      
Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms], Threads   
       1,        5,             10,    1,000,000,    174,766,    5721.91,      5 -> 84
       1,       10,             20,    2,000,000,    179,417,   11147.18,      7 -> 83
       1,       15,             30,    3,000,000,    177,977,   16856.02,      3 -> 79
       1,       20,             40,    4,000,000,    181,384,   22052.61,      3 -> 71
       1,       25,             50,    5,000,000,    184,619,   27082.80,      2 -> 70
       1,       30,             60,    6,000,000,    176,577,   33979.39,      2 -> 68

Using TaskSchedulerExecutor w/ Bounded Concurrency in v1.4.19

OSVersion:                         Microsoft Windows NT 10.0.19041.0
ProcessorCount:                    16
Node Count:                        2
Active Connections:                1
Msgs sent/received per actor:      200000  (2e5)
Is Server GC:                      True
Thread count per node:             44

Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms], Threads
       1,        5,             10,    1,000,000,    141,820,    7051.18,    9 -> 78
       1,       10,             20,    2,000,000,    154,477,   12946.90,    9 -> 78
       1,       15,             30,    3,000,000,    177,678,   16884.40,    6 -> 65
       1,       20,             40,    4,000,000,    193,058,   20719.06,    8 -> 57
       1,       25,             50,    5,000,000,    193,859,   25791.82,    5 -> 54
       1,       30,             60,    6,000,000,    195,050,   30761.33,    5 -> 51

Much more visible results here - the TaskSchedulerExecutor wins in terms of overall performance once it has a chance to tune its ThreadPool by round 4 - but it also is producing these numbers using ~40% CPU.

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Apr 28, 2021

Going to try this one more time using the TaskSchedulerExecutor instead, since that will save us the trouble of having to install another dependency and is largely backwards-compatible with what we have today.

edit: Whoops, all of that data was with the TaskSchedulerExecutor - need to try ChannelExecutor instead.

@Aaronontheweb
Copy link
Member Author

ChannelExecutor Performance

Docker

seed_1          | OSVersion:                         Unix 5.4.72.2
seed_1          | ProcessorCount:                    16
seed_1          | Node Count:                        2
seed_1          | Active Connections:                1
seed_1          | Msgs sent/received per actor:      200000  (2e5)
seed_1          | Is Server GC:                      True
seed_1          | Thread count per node:             41
seed_1          |
seed_1          | Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms], Threads
seed_1          |        1,        5,             10,    1,000,000,    114,908,    8702.59,    1 -> 57
seed_1          |        1,       10,             20,    2,000,000,    134,613,   14857.36,    1 -> 50
seed_1          |        1,       15,             30,    3,000,000,    140,762,   21312.48,    2 -> 50
seed_1          |        1,       20,             40,    4,000,000,    138,127,   28958.82,    2 -> 50
seed_1          |        1,       25,             50,    5,000,000,    138,427,   36119.92,    2 -> 50
seed_1          |        1,       30,             60,    6,000,000,    138,373,   43360.90,    1 -> 50

Bare Metal

OSVersion:                         Microsoft Windows NT 10.0.19041.0
ProcessorCount:                    16
Node Count:                        2
Active Connections:                1
Msgs sent/received per actor:      200000  (2e5)
Is Server GC:                      True
Thread count per node:             45

Connections, Actors/node, Total [actor], Total [msg], Msgs/sec, Total [ms], Threads
       1,        5,             10,    1,000,000,    154,427,    6475.54,    9 -> 79
       1,       10,             20,    2,000,000,    157,232,   12719.98,   13 -> 79
       1,       15,             30,    3,000,000,    165,067,   18174.39,    6 -> 61
       1,       20,             40,    4,000,000,    188,741,   21192.97,    6 -> 54
       1,       25,             50,    5,000,000,    194,141,   25754.36,    6 -> 54
       1,       30,             60,    6,000,000,    187,075,   32072.54,    8 -> 54

Looks about even with the TaskScheduler but I think the former's numbers are slightly better...

@Aaronontheweb
Copy link
Member Author

Full ChannelExecutor source code, even though I'm removing it from this PR in favor of the FixedConcurrencyTaskScheduler implementation:

internal sealed class ChannelExecutor : ExecutorService
    {
        private Channel<IRunnable> _channel;
        private CancellationTokenSource _cts = new CancellationTokenSource();

        private int _readers = 0;

        public int DegreeOfParallelism { get; }

        public ChannelExecutor(string id, int degreeOfParallelism) : base(id)
        {
            _channel = Channel.CreateUnbounded<IRunnable>(new UnboundedChannelOptions()
            {
                AllowSynchronousContinuations = true,
                SingleReader = false,
                SingleWriter = true
            });

            DegreeOfParallelism = degreeOfParallelism;
        }

        private static WaitCallback Executor = o => ReadChannel((ChannelExecutor)o);

        private static void ReadChannel(ChannelExecutor executor)
        {
            try
            {
                while ( executor._channel.Reader.TryRead(out var runnable))
                {
                    runnable.Run();
                }
            }
            catch
            {
                // suppress exceptions
            }
            finally
            {
                Interlocked.Decrement(ref  executor._readers);
            }
        }

        public override void Execute(IRunnable run)
        {
            if (_channel.Writer.TryWrite(run))
            {                
                if (_readers < DegreeOfParallelism)
                {
                    var initial = _readers;
                    var newVale = _readers + 1;
                    if (initial == Interlocked.CompareExchange(ref _readers, newVale, initial))
                    {
                        // try to start a new worker
                        ThreadPool.UnsafeQueueUserWorkItem(Executor, this);
                    }
                }
            }
        }

        public override void Shutdown()
        {
            _channel.Writer.Complete();
            _cts.Cancel();
        }
    }

@Aaronontheweb
Copy link
Member Author

Writing up some documentation for this now, but once that's done this should be good to go. Going to make this an opt-in feature as part of Akka.NET v1.4.19.

@Aaronontheweb Aaronontheweb changed the title Experiment: System.Threading.Channels dispatchers Introduce ChannelExecutor Apr 28, 2021
…ler` internally - have `ActorSystem.Create` call `ThreadPool.SetMinThreads(0,0)` to improve performance across the board.
@Aaronontheweb Aaronontheweb force-pushed the dispatch/System.Threading.Channels-impl branch from 1170ca1 to e7b40ea Compare April 28, 2021 15:30
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described my changes.

# Fixed number of threads to have in this threadpool
thread-count = 4
executor = fork-join-executor
fork-join-executor {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the right configuration, rather than a hard thread count of 4.

// allows the ThreadPool to scale up / down dynamically
// by removing minimum thread count, which in our benchmarks
// appears to negatively impact performance
ThreadPool.SetMinThreads(0, 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The differences in the benchmarks with this and without it are massive. However, if this causes problems for users they can easily reset it by calling ThreadPool.SetMinThreads(yourValue, yourValue) after the ActorSystem is created. I doubt many users will need to do that and changing this default likely works to the benefit of most.

@@ -116,6 +116,26 @@ protected ExecutorServiceConfigurator(Config config, IDispatcherPrerequisites pr
public IDispatcherPrerequisites Prerequisites { get; private set; }
}

internal sealed class ChannelExecutorConfigurator : ExecutorServiceConfigurator
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to configure the TaskSchedulerExecutor(id, new FixedConcurrencyTaskScheduler(MaxParallelism));

@@ -306,6 +326,8 @@ protected ExecutorServiceConfigurator ConfigureExecutor()
return new CurrentSynchronizationContextExecutorServiceFactory(Config, Prerequisites);
case "task-executor":
return new DefaultTaskSchedulerExecutorConfigurator(Config, Prerequisites);
case "channel-executor":
return new ChannelExecutorConfigurator(Config, Prerequisites);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we actually pass the new executor into dispatcher configurations.


public override ExecutorService Produce(string id)
{
Prerequisites.EventStream.Publish(new Debug($"ChannelExecutor-[id]", typeof(FixedConcurrencyTaskScheduler), $"Launched Dispatcher [{id}] with MaxParallelism=[{MaxParallelism}]"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug event you can listen for if you want to be certain that the ChannelExecutor is loaded.

public ChannelExecutorConfigurator(Config config, IDispatcherPrerequisites prerequisites) : base(config, prerequisites)
{
var fje = config.GetConfig("fork-join-executor");
MaxParallelism = ThreadPoolConfig.ScaledPoolSize(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-use ForkJoinDispatcher configuration block, since it expresses what we need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might actually need to copy HOCON lines over and make them part of the executor config in the future, because having one part of the system depending on a config of another part can be a point of confusion for the end user

@@ -34,9 +34,6 @@ This pattern may seem to be very tempting to use at first, but it has several dr

Especially the last point is something you should be aware of — in general when using the Cluster Singleton pattern you should take care of downing nodes yourself and not rely on the timing based auto-down feature.

> [!WARNING]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops - this was actually from a previous PR.

@Aaronontheweb Aaronontheweb marked this pull request as ready for review April 28, 2021 15:39
Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, just a few changes needed in the documentation

docs/articles/actors/dispatchers.md Outdated Show resolved Hide resolved
docs/articles/actors/dispatchers.md Outdated Show resolved Hide resolved
public ChannelExecutorConfigurator(Config config, IDispatcherPrerequisites prerequisites) : base(config, prerequisites)
{
var fje = config.GetConfig("fork-join-executor");
MaxParallelism = ThreadPoolConfig.ScaledPoolSize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might actually need to copy HOCON lines over and make them part of the executor config in the future, because having one part of the system depending on a config of another part can be a point of confusion for the end user

@Aaronontheweb
Copy link
Member Author

We might actually need to copy HOCON lines over and make them part of the executor config in the future, because having one part of the system depending on a config of another part can be a point of confusion for the end user

Agree, but not going to do that today since this is meant to be a drop-in replacement for ForkJoinDispatcher. Goal is to eventually purge DedicatedThreadPool out of the code base for good.

@Aaronontheweb
Copy link
Member Author

We might actually need to copy HOCON lines over and make them part of the executor config in the future, because having one part of the system depending on a config of another part can be a point of confusion for the end user

Agree, but not going to do that today since this is meant to be a drop-in replacement for ForkJoinDispatcher. Goal is to eventually purge DedicatedThreadPool out of the code base for good.

Should also point out, this is a sub-config of the current HOCON node anyway

@Arkatufus Arkatufus enabled auto-merge (squash) April 28, 2021 16:12
@Aaronontheweb
Copy link
Member Author

ThreadPool.SetMinThreads(0,0) definitely skewed some of the tests inside the TestKit - going to need to take a look at that.

@Aaronontheweb Aaronontheweb merged commit bdfc893 into akkadotnet:dev Apr 28, 2021
@Aaronontheweb Aaronontheweb deleted the dispatch/System.Threading.Channels-impl branch April 28, 2021 17:41
This was referenced Apr 28, 2021
Aaronontheweb added a commit that referenced this pull request Apr 28, 2021
* Added v1.4.19 placeholder

* close #4860 - use local deploy for TcpManager child actors. (#4862)

* close #4860 - use local deploy for TcpManager child actors.

* Use local deploy for TcpIncomingConnection.

* Use local deploy for Udp actors.

Co-authored-by: Erik Folstad <[email protected]>
Co-authored-by: Aaron Stannard <[email protected]>

* Merge pull request #4875 from akkadotnet/dependabot/nuget/Hyperion-0.9.17

Bump Hyperion from 0.9.16 to 0.9.17

* Bump Newtonsoft.Json from 12.0.3 to 13.0.1 (#4866)

Bumps [Newtonsoft.Json](https://github.com/JamesNK/Newtonsoft.Json) from 12.0.3 to 13.0.1.
- [Release notes](https://github.com/JamesNK/Newtonsoft.Json/releases)
- [Commits](JamesNK/Newtonsoft.Json@12.0.3...13.0.1)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* Fix ClusterMetricsExtensionSpec racy spec

* Clean up Akka.Stream file stream (#4874)

* Make sure that FileSubscriber shuts down cleanly when it dies

* Make sure that file all sink spec release the file handle if it fails

* Supress ActorSelectionMessage with DeadLetterSuppression (migrated from akka/akka#28341) (#4889)

* for example the Cluster InitJoin message is marked with DeadLetterSuppression
  but was anyway logged because sent with actorSelection
* for other WrappedMessage than ActorSelectionMessage we shouldn't unwrap and publish
  the inner in SuppressedDeadLetter because that might loose some information
* therefore those are silenced in the DeadLetterListener instead

Better deadLetter logging of wrapped messages (migrated from akka/akka#28253)

Logging of UnhandledMessage (migrated from akka/akka#28414)
* make use of the existing logging of dead letter
  also for UnhandledMessage

Add Dropped to Akka.Actor (migrated partially from akka/akka#27160)
Log Dropped from DeadLetterListener

* add CultureInfo for Turkish OS (#4880)

* add CultureInfo for Turkish OS

added English CultureInfo to fix ToUpper function causes error on Turkish OS. "warning"->"WARNİNG"

* fix LogLevel TR char error

Co-authored-by: Aaron Stannard <[email protected]>

* Harden FileSink unit tests by using AwaitAssert to wait for file operations to complete (#4891)

* Harden FileSink unit tests by using AwaitAssert to wait for file operations to complete

* Use AwaitResult to improve readability

Co-authored-by: Aaron Stannard <[email protected]>

* Handle CoordinatedShutdown exiting-completed when not joined (#4893)

* assertion failed: Nodes not part of cluster have marked the Gossip as seen
* trying to mark the Gossip as seen before it has joined, which may happen
  if CoordinatedShutdown is running before the node has joined

migrated from akka/akka#26835

* Persistence fixes (#4892)

* snapshot RecoveryTick ignored, part of akka/akka#20753

* lastSequenceNr should reflect the snapshot sequence and not start with 0 when journal is empty. Migrated from akka/akka#27496

* Enforce valid seqnr for deletes, migrated from akka/akka#25488

* api approval

* Added DoNotInherit annotation (#4896)

* Bump Microsoft.NET.Test.Sdk from 16.9.1 to 16.9.4 (#4894)

* Add Setup class for NewtonSoftJsonSerializer (#4890)

* Add Setup class for NewtonSoftJsonSerializer

* Use Setup as a settings modifier instead of a settings factory

* Update spec

* Update API Approval list

* Unit test can inject null ActorSystem into the serializer causing the Setup system to throw a NRE

* Add documentation

* fixed up copyright headers (#4898)

* Bump Google.Protobuf from 3.15.6 to 3.15.7 (#4900)

Bumps [Google.Protobuf](https://github.com/protocolbuffers/protobuf) from 3.15.6 to 3.15.7.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/master/generate_changelog.py)
- [Commits](protocolbuffers/protobuf@v3.15.6...v3.15.7)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* Added PhiAccrualFailureDetector warning logging for slow heartbeats (#4897)

Ported from akka/akka#17389 and akka/akka#24701

* replace reflection magic in MNTR with reading of `MultiNodeConfig` properties (#4902)

* close #4901 - replace reflection magic in MNTR with reading of MultiNodeConfig properties

* fixed outdated DiscoverySpec

* fixed SBR logging error that blew up StandardOutLogger (#4909)

This format error would cause the StandardOutLogger to throw a `FormatException` internally

* added timestamp to node failures in MNTR (#4911)

* cleaned up the SpecPass / SpecFail messages (#4912)

* reduce allocations inside PhiAccrualFailureDetector (#4913)

made `HeartbeatHistory` into a `readonly struct` and cleaned up some other old LINQ calls inside the data structure

* Bump Microsoft.Data.SQLite from 5.0.4 to 5.0.5 (#4914)

* [MNTR] Add include and exclude test filter feature (#4916)

* Add -Dmultinode.include and -Dmultinode.exclude filter feature

* Add documentation

* Fix typos and makes sentences more readable

* Make the sample command line wrap instead of running off the screen

* Change include and exclude filtering by method name instead (requested)

* cleaned up RemoteWatcher (#4917)

* Fixed System.ArgumentNullException in Interspase operation on empty stream finish. (#4918)

* Rewrite the AkkaDiFixture so that it does not need to start a HostBuilder (#4920)

* Fix case where PersistenceMessageSerializer.FromBinary got a null for its type parameter (#4923)

* Bump Google.Protobuf from 3.15.7 to 3.15.8 (#4927)

Bumps [Google.Protobuf](https://github.com/protocolbuffers/protobuf) from 3.15.7 to 3.15.8.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/master/generate_changelog.py)
- [Commits](protocolbuffers/protobuf@v3.15.7...v3.15.8)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* close #4096 - documented how to terminate remembered entities (#4928)

Updated the Akka.Cluster.Sharding documentation to explain how to terminate remembered-entities.

* Add CLI switches to show help and version number (#4925)

* cleaned up protobuf CLI and definitions (#4930)

* cleaned up protobuf CLI and definitions

- Remove all `optional` fields (not allowed in Protobuf3, as all fields are optional by default unless specifically defined as `required`)
- Removed `--experimental_allow_proto3_optional` call from `protoc` compiler as it's no longer supported / needed
- Doesn't have any impact on existing wire formats, especially for `ClusterMessages.proto` which is where I removed all of the `optional` commands

* fixed compilation error caused by change in generated `AppVersion` output

* Fix MNTK specs for DData: DurablePruningSpec (#4933)

* Powershell splits CLI arguments on "." before passing them into applications (#4924)

* porting Cluster heartbeat timings, hardened Akka.Cluster serialization (#4934)

* porting Cluster heartbeat timings, hardened Akka.Cluster serialization

port akka/akka#27281
port akka/akka#25183
port akka/akka#24625

* increased ClusterLogSpec join timespan

Increased the `TimeSpan` here to 10 seconds in order to prevent this spec from failing racily, since even an Akka.Cluster self-join can take more than the default 3 seconds due to some of the timings involved in node startup et al.

* Bump Hyperion from 0.9.17 to 0.10.0 (#4935)

Bumps [Hyperion](https://github.com/akkadotnet/Hyperion) from 0.9.17 to 0.10.0.
- [Release notes](https://github.com/akkadotnet/Hyperion/releases)
- [Changelog](https://github.com/akkadotnet/Hyperion/blob/dev/RELEASE_NOTES.md)
- [Commits](akkadotnet/Hyperion@0.9.17...0.10.0)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* Add spec for handling delegates in DI (#4922)

* Add spec for handling delegates in DI

* Make sure spec exits cleanly by terminating the actor system.

* Add spec where singleton delegate is called from another actor

* Fix racy test

Co-authored-by: Aaron Stannard <[email protected]>

* Bump FsCheck from 2.15.1 to 2.15.2 (#4939)

* ClusterStressSpec and Cluster Failure Detector Cleanup (#4940)

* implementation of Akka.Cluster.Tests.MultiNode.StressSpec

* made MuteLog overrideable in Akka.Cluster.Testkit

* if Roles is empty, then don't run the thunk on any nodes

Changed this to make it consistent with the JVM

* made it possible to actually enable Cluster.AssertInvariants via environment variable

* added assert invariants to build script

cleaned up gossip class to assert more invariants

* ReSharper'd Reachability.cs

* cleaned up immutability and CAS issues inside DefaultFailureDetectorRegistry

added bugfix from akka/akka#23595

* Bump FsCheck.Xunit from 2.15.1 to 2.15.2 (#4938)

Bumps [FsCheck.Xunit](https://github.com/fsharp/FsCheck) from 2.15.1 to 2.15.2.
- [Release notes](https://github.com/fsharp/FsCheck/releases)
- [Changelog](https://github.com/fscheck/FsCheck/blob/master/FsCheck%20Release%20Notes.md)
- [Commits](fscheck/FsCheck@2.15.1...2.15.2)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* cleanup `AKKA_CLUSTER_ASSERT` environment variable (#4942)

Per some of the suggestions on #4940 PR review

* harden Akka.DependencyInjection.Tests (#4945)

Added some `AwaitAssert` calls to check for disposed dependencies - these calls can be racy due to background actor thread calling `Dipose` after foreground test thread checks the `Disposed` property.

* HeartbeatNodeRing performance (#4943)

* added benchmark for HeartbeatNodeRing performance

* switched to local function

No perf change

* approve Akka.Benchmarks friend assembly for Akka.Cluster

* remove HeartbeatNodeRing.NodeRing() allocation and make field immutable

* made it so Akka.Util.Internal.ArrayExtensions.From no longer allocates (much)

* added some descriptive comments on HeartbeatNodeRing.Receivers

* Replaced `Lazy<T>` with `Option<T>` and a similar lazy initialization check

 Improved throughput by ~10% on larger collections and further reduced memory allocation.

* changed return types to `IImmutableSet`

Did this in order to reduce allocations from constantly converting back and forth from `ImmutableSortedSet<T>` and `ImmutableHashSet<T>` - that way we can just use whatever the underlying collection type is.

* added ReachabilityBenchmarks

* modified PingPong / RemotePingPong benchmarks to display threadcount (#4947)

Using this to gauge the impact certain dispatcher changes have on the total number of active threads per-process

* Configure duration for applying `MemberStatus.WeaklyUp`  to joining nodes (#4946)

* Configure duration for applying `MemberStatus.WeaklyUp`  to joining nodes

port of akka/akka#29665

* fixed validation check for TimeSpan duration passed in via HOCON

* harden ClusterLogSpecs

* restored Akka.Cluster model-based FsCheck specs (#4949)

* added `VectorClock` benchmark (#4950)

* added VectorClock benchmark

* fixed broken benchmark comparisons

* Performance optimize `VectorClock` (#4952)

* Performance optimize `VectorClock`

* don't cache MD5, but dispose of it

* guarantee disposal of iterators during VectorClock.Compare

* switch to local function for `VectorClock.CompareNext`

* fixed a comparison bug in how versions where compared

* minor cleanup

* replace `KeyValuePair<TKey,TValue>` with `ValueTuple<TKey,TValue>`

Reduced allocations by 90%, decreased execution time from 100ms to ~40ms

* harden RestartFirstSeedNodeSpec (#4954)

* harden RestartFirstSeedNodeSpec

* validate that we have complete seed node list prior to test

* Turned `HeatbeatNodeRing` into `struct` (#4944)

* added benchmark for HeartbeatNodeRing performance

* switched to local function

No perf change

* approve Akka.Benchmarks friend assembly for Akka.Cluster

* remove HeartbeatNodeRing.NodeRing() allocation and make field immutable

* made it so Akka.Util.Internal.ArrayExtensions.From no longer allocates (much)

* added some descriptive comments on HeartbeatNodeRing.Receivers

* Replaced `Lazy<T>` with `Option<T>` and a similar lazy initialization check

 Improved throughput by ~10% on larger collections and further reduced memory allocation.

* changed return types to `IImmutableSet`

Did this in order to reduce allocations from constantly converting back and forth from `ImmutableSortedSet<T>` and `ImmutableHashSet<T>` - that way we can just use whatever the underlying collection type is.

* converted `HeartbeatNodeRing` into a `struct`

improved performance some, but I don't want to lump it in with other changes just in case

* Add generalized crossplatform support for Hyperion serializer. (#4878)

* Add the groundwork for generalized crossplatform support.

* Update Hyperion to 0.10.0

* Convert adapter class to lambda object

* Add HyperionSerializerSetup setup class

* Add unit test spec

* Improve specs, add comments

* Add documentation

* Add copyright header.

* Change readonly fields to readonly properties.

* cleaned up `ReceiveActor` documentation (#4958)

* removed confusing and conflicting examples in the `ReceiveActor` documentation
* Embedded reference to "how actors restart" YouTube video in supervision docs

* updated website footer to read 2021 (#4959)

* added indicator for `ClusterResultsAggregator` in `StressSpec` logs (#4960)

Did this to make it easier to search for output logs produced during each phase of the `StressSpec`

* Bump Hyperion from 0.10.0 to 0.10.1 (#4957)

* Bump Hyperion from 0.10.0 to 0.10.1

Bumps [Hyperion](https://github.com/akkadotnet/Hyperion) from 0.10.0 to 0.10.1.
- [Release notes](https://github.com/akkadotnet/Hyperion/releases)
- [Changelog](https://github.com/akkadotnet/Hyperion/blob/dev/RELEASE_NOTES.md)
- [Commits](akkadotnet/Hyperion@0.10.0...0.10.1)

Signed-off-by: dependabot-preview[bot] <[email protected]>

* Fix dependabot Hyperion issue (#4961)

* Update Akka.Remote.Tests.csproj to use common.props

* Update HyperionSerializer to reflect recent hyperion changes

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Co-authored-by: Gregorius Soedharmo <[email protected]>

* Perf optimize `ActorSelection` (#4962)

* added memory metrics to `ActorSelection` benchmarks

* added ActorSelection benchmark

* ramped up the iteration counts

* validate that double wildcard can't be used outside of leaf node

* improve allocations on create

* minor cleanup

* create emptyRef only when needed via local function

* made `Iterator` into `struct`

* approved public API changes

* `Reachability` performance optimziation (#4955)

* reduced iteration count to speed up benchmarks

* optimize some System.Collections.Immutable invocations to allocate less

* cleanup dictionary construction

* fixed multiple enumeration bug in `Reachability`

* Fix `SpawnActor` benchmark (#4966)

* cleaned up SpawnActorBenchmarks

* cleaned up SpawnActor benchmarks

* fixed N-1 error inside `Mailbox` (#4964)

This error has no impact on extremely busy actors, but for actors who have to process small bursts of messages this can make the difference between getting everything done in one dispatch vs. doing it in two.

* Clean up bad outbound ACKs in Akka.Remote (#4963)

port of akka/akka#20093

Might be responsible for some quarantines in Akka.Cluster / Akka.Remote when nodes are restarting on identical addresses.

* UnfoldResourceSource closing twice on failure (#4969)

* Added test cases where close would be called twice

* Bugfix UnfoldResource closed resource twice on failure

* Add retry pattern with delay calculation support (#4895)

Co-authored-by: Aaron Stannard <[email protected]>

* simplified the environment variable name for StressSpec (#4972)

* Refactored `Gossip` into `MembershipState` (#4968)

* refactor Gossip class into `MembershipState`

port of akka/akka#23291

* completed `MembershipState` port

* fixed some downed observers calls

* forgot to copy gossip upon `Welcome` from Leader

* forgot to copy `MembershipState` while calling `UpdateLatestGossip`

* refactored all DOWN-ing logic to live inside `Gossip` class

* added some additional methods onto `MembershipState`

* fixed ValidNodeForGossip bug

* fixed equality check for Reachability

 should be quality by reference, not by value

Co-authored-by: Gregorius Soedharmo <[email protected]>

* Fix serialization verification problem with Akka.IO messages (#4974)

* Fix serialization verification problem with Akka.IO messages

* Wrap naked SocketAsyncEventArgs in a struct that inherits INoSerializationVerificationNeeded

* Make the wrapper struct readonly

* Expand exception message with their actor types

* Update API Approver list

* ORDictionary with POCO value missing items, add MultiNode spec (#4910)

* Update DData reference HOCON config to follow JVM

* Clean up ReplicatorSettings, add sensible default values.

* Add DurableData spec that uses ORDictionary with POCO values

* Add a special case for Replicator to suppress messages during load

* Slight change in public API, behaviour is identical.

* Change Replicator class to UntypedActor

* Code cleanup - Change fields to properties

* Update API Approver list

* clean up seed node process (#4975)

* fixed racy ActorModelSpec (#4976)

fixed `A_dispatcher_must_handle_queuing_from_multiple_threads` - we were using the wrong message type the entire time, and the previous instance caused `Thread.Sleep` to be called repeatedly.

* Update PluginSpec so that it can accept ActorSystem and ActorSystemSetup in its constructor (#4978)

* Removed inaccurate warning from Cluster Singleton docs (#4980)

Cluster singletons won't create duplicates in a cluster split scenario, and it's much safer to run them _with_ a split brain resolver on than without. This documentation was just out of date.

* Introduce `ChannelExecutor` (#4882)

* added `ChannelExecutor` dispatcher - uses `FixedConcurrencyTaskScheduler` internally - have `ActorSystem.Create` call `ThreadPool.SetMinThreads(0,0)` to improve performance across the board.

* fixed documentation errors

* Upgrade to GitHub-native Dependabot (#4984)

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* added v1.4.19 release notes (#4985)

* added v1.4.19 release notes

Co-authored-by: Erik Følstad <[email protected]>
Co-authored-by: Erik Folstad <[email protected]>
Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Co-authored-by: Igor <[email protected]>
Co-authored-by: Gregorius Soedharmo <[email protected]>
Co-authored-by: zbynek001 <[email protected]>
Co-authored-by: Cagatay YILDIZOGLU <[email protected]>
Co-authored-by: Ismael Hamed <[email protected]>
Co-authored-by: Anton V. Ilyin <[email protected]>
Co-authored-by: Arjen Smits <[email protected]>
@Zetanova
Copy link
Contributor

I would not recommend to set ThreadPool.SetMinThreads(0,0) or do any global ThreadPool settings.
Why set both limits work and IO threads?

I don't realy understand the effect that would have other then the ThreadPool schedule algo would be used instantly
and the first messages/IRunnable would be a bit delayed.

An idea would be first to add a totalReaders counter to count how may readers got called over the test-run.
Less would be better. I believe only to set ThreadPool.SetMinThreads(0,0) would have the effect to reduce the totalReaders count.
And other counters to count worker executions that got 0, 1 or many work-items.
Counters: totalReaders, zeroReader, singleReaders

To have the same effect as ThreadPool.SetMinThreads(0,0) a wrapped IRunnable could be used to start the next worker from one of the currently running wrokers.

@Aaronontheweb
Copy link
Member Author

I would not recommend to set ThreadPool.SetMinThreads(0,0) or do any global ThreadPool settings.
Why set both limits work and IO threads?

This setting removes the lower bound on the number of minimum threads - now there are no limits, as the lower bound is zero. The previous value was Environment.ProcessorCount, which as our benchmarks showed was probably more than even most busy systems need.

In the event of an extremely busy system, this change probably won't have much impact as those threads will continue to be used. The one downside might be that a bunch of threads might be killed and recreated right away at startup, and Ben Adams pointed out some potential memory overhead from work-stealing that might be incurred by this change too: https://twitter.com/ben_a_adams/status/1387236911018819596

If the setting has adverse affects for a lot of our users, we'll roll it back and make it toggleable - but changing that setting improved all of the Akka.NET v1.4.18 defaults too.

An idea would be first to add a totalReaders counter to count how may readers got called over the test-run.
Less would be better. I believe only to set ThreadPool.SetMinThreads(0,0) would have the effect to reduce the totalReaders count.
And other counters to count worker executions that got 0, 1 or many work-items.
Counters: totalReaders, zeroReader, singleReaders

We'd need to keep track of how busy all of the current readers are and factor that in too, the time component. But I'm open to auto-tuning our use of workers on the thread pool as well. Definitely interested in doing more work in that area.

@Zetanova
Copy link
Contributor

My proposed coutners would only for perf messurement.
I believe that the new work item will get often "stolen" by an other worker
and the queued up worker would run empty. zeroReaders++

Most likely the effect of ThreadPool.SetMinThreads(0,0) would be that
worker-1, worker-2, worker-3 would be queued in the ThreadPool queue
and worker-0 would process all work-items more or less by himself.

@Aaronontheweb
Copy link
Member Author

@Zetanova I'd be very interested in seeing some experiments with this construct from contributors like you - your original Channel<T> implementation was one of the inspirations for this!

@Zetanova
Copy link
Contributor

@Aaronontheweb
I will do one and test the current one at the weekend.

The refactor of the HashWheelTaskScheduler is on my todo list too.

@Aaronontheweb
Copy link
Member Author

The refactor of the HashWheelTaskScheduler is on my todo list too.

That's another good one - as @to11mtm often says, adding support for "Schedule with fixed delay" is something we should aim for by the time Akka.NET v1.5 is released.

@Zetanova
Copy link
Contributor

@Aaronontheweb I didn't used the new dotnet counters for now, but would it be not an idea to use implement them here
https://docs.microsoft.com/en-gb/dotnet/core/diagnostics/dotnet-counters
Of course about the overhead we would need to be very careful.

The benefit would be that somebody could exec the dotnet counter output on production / kubernetes cluster
without a spezial build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants